EEA, 2 March 2019

What will we talk about?

We highligh several efficiencies

  • R Markdown is our one-stop-shop solution
  • R Markdown is inherently reproducible
  • R & R Markdown calibrated for collaboration
  • R & R Markdown are free
  • BIG Thank you to Project TIER and the Alfred P. Sloan Foundation

Challenges confronted?

  • Learning to code
  • Learning curve
  • Stata dnd Dyndoc
  • Class size

Case Studies

St. Lawrence U.

St. Lawrence U.

Replicability & Reproducibility

  • Replicability in economics is a problem, see Camerer et al. 2016; Christensen and Miguel 2016
  • Also bad in psychology & affiliated social sciences (Open Science Collaboration 2015)
  • relates to the problem of reproducbility (a pre-condition for replication)
  • "spreadsheet error" underpinning Reinhart and Rogoff (2010); see Herndon, Ash, and Pollin (2014)

What is R Markdown?

  • R Markdown (.Rmd) is a file format
  • Use a source file to produce an output file
  • Source file contains
  • prose
  • analysis
  • Provides many kinds of output

What is R Studio?

R Studio IDE

R Studio IDE

R Markdown & Reproducibility

  • What happens in a traditional research report?
  • Are traditional research reports easily reproducible?
  • What gives us one-stop-shop reproducibility?
  • Answer: R Markdown
  • Works within the TIER framework

Traditional Reports

Courtesy of Bray, 2016

Courtesy of Bray, 2016

Good

  • familiar format, e.g. Word
  • easy learning curve

Bad

  • tough for reproducibility
  • difficult to update
  • mistakes crop up
  • teams can't collaborate easily

Ugly?

  • Word/GDocs = Ugly?

R Markdown Report/Notebook?

Courtesy of Bray, 2016

Courtesy of Bray, 2016

Good

  • easy to reproduce
  • easy to edit/update
  • easy to collaborate
  • standardized & fast

Bad

  • students must learn syntax
  • error-free to compile

Ugly?

  • inequality in student backgrounds

Text Formatting

# Header 1

## Header 2

### Header 3

This is normal sized text used in the body of our work. 

For bullet points, we use dashes, e.g. 

- Intro to RStudio
- More content
  - a sub-point
- Back to the original level

Document Types

R Markdown can produce a variety of document types (other than the default html page):

  • pdf_document makes a PDF with LaTeX (.pdf)

  • word_document for Microsoft Word documents (.docx).

  • odt_document for OpenDocument Text documents (.odt).

  • rtf_document for Rich Text Format documents (.rtf)

And others.

Presentation Types

R Markdown can also be re-purposed to produce a presentation file (as with this presentation):

  • io_slides opens in your browser and interactive (.html)

  • slidy another browser based presentation format (.html)

  • beamer makes a PDF with LaTeX (.pdf)

Data work

Think about data analysis as falling into three loose categories:

  • management & wrangling
  • visualization & summary statistics
  • modeling & inference

All of this occurs in the code "chunk"

Code chunks

  • To open a code chunk hit CMD + OPTION + I on a Mac

  • Or type out three backticks ``` folowed by {r}

  • And then three more back ticks ``` on another line.

  • Within the {r} you can specify options, like {eval = FALSE} if you don't want it to evaluate the code

  • Or you can label the code chunk, e.g. {r cars} labels the chunk "cars" in your ToC

Code Chunk: Example

```{r cars, echo = TRUE}
summary(cars)
```

The option echo = TRUE means that the code gets included in the rendered html.

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Take-home messages

  • Free: Because Rmd package suite is free, students can use on own laptops or access R Studio Server (efficiency in costs)
  • On Demand: Free => on demand; over break; weekends; dorm room (massive issue with Stata/SPSS)
  • Workflow & suite: Everything in one place & exportable to different outputs (package suite & workflow efficiency)
  • Reproducibility: Enables reproducible work through R Studio & Rmd basics, helped by coaching file structure & documentation (efficiency in workflow)

Counters?

  1. "R and R Markdown are open source and do not have the interests of private, profit-maximizing firms at their core; they will therefore be nec- essarily inferior to Stata, SAS and private sector alternatives."
  2. "Stata has released dyndoc, so I don't need to learn RMarkdown."
  3. "I cannot get the equivalent of [Command X] that I use in [Statistical Package Y]. I can't work without [Command X]."
  4. "I don't know R and there is a fixed cost to switching. It is irrational for me to incur that cost without subsidization from [my department/college/profession]."
  5. "This might work in small classes, but not in my 100-student class."

What else?

R Markdown and R Studio together have excellent capabilities.

  • R Studio can show you the output of the commands within the R Markdown file
  • R Studio has error-detection and debugging assistance for your code (unlike, e.g. STATA or aspects of Excel)
  • R Studio server can be hosted online and your students work with logins there

Conclusion

  • Adopting R Markdown \(\uparrow\) classroom efficiency
  • Tradeoffs? Content coverage?
  • Externalities? Greater success in later courses/theses
  • Students more data literate
  • Integrity & economic research w/ reproducibility

Slide with Plot (Reproduction of Sutter, 2009)

Dynamic Graphs

Plotly Graphs

Alter and check some data

##   session subject  r1  r2  r3  r4  r5  r6  r7  r8  r9  treatment team
## 1       1       1   0   0   0   0  10  10   0   0   0 individual   NA
## 2       1       2   0   0  30  40  40   0   0   0  20 individual   NA
## 3       1       3  30  30   0   0   0  60  60  10   0 individual   NA
## 4       1       4  20   0 100   0   0  30  75 100 100 individual   NA
## 5       1       5 100 100 100 100 100 100 100 100 100 individual   NA
## 6       1       6 100 100 100 100 100 100 100   0   0 individual   NA
##           uniqid
## 1 1_individual_1
## 2 1_individual_2
## 3 1_individual_3
## 4 1_individual_4
## 5 1_individual_5
## 6 1_individual_6

Statistical Tests

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  value by treatment
## W = 52876, p-value = 3.838e-10
## alternative hypothesis: true location shift is not equal to 0

Regression output

## 
## Call:
## lm(formula = value ~ treatment, data = SutNarrow)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.370 -29.385  -0.542  38.630  60.615 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          39.385      1.451  27.152  < 2e-16 ***
## treatmentmessage     21.985      1.994  11.028  < 2e-16 ***
## treatmentmixed       10.609      1.925   5.510 3.92e-08 ***
## treatmentpaycomm     10.886      2.144   5.077 4.09e-07 ***
## treatmentteamtreat   16.313      2.629   6.204 6.34e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.81 on 2713 degrees of freedom
## Multiple R-squared:  0.04473,    Adjusted R-squared:  0.04333 
## F-statistic: 31.76 on 4 and 2713 DF,  p-value: < 2.2e-16

Or a Panel Regression

## Oneway (time) effect Random Effect Model 
##    (Swamy-Arora's transformation)
## 
## Call:
## plm(formula = value ~ treatment, data = SutNarrow, effect = "time", 
##     model = "random", index = c("uniqid"))
## 
## Balanced Panel: n = 302, T = 9, N = 2718
## 
## Effects:
##                    var  std.dev share
## idiosyncratic 1197.631   34.607 0.987
## time            16.062    4.008 0.013
## theta: 0.555
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -61.9325 -28.6639  -2.5889  33.7276  64.3014 
## 
## Coefficients:
##                    Estimate Std. Error z-value  Pr(>|z|)    
## (Intercept)         39.3854     1.9657 20.0367 < 2.2e-16 ***
## treatmentmessage    21.9850     1.9818 11.0936 < 2.2e-16 ***
## treatmentmixed      10.6093     1.9140  5.5430 2.973e-08 ***
## treatmentpaycomm    10.8862     2.1315  5.1072 3.270e-07 ***
## treatmentteamtreat  16.3130     2.6138  6.2412 4.342e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    3403100
## Residual Sum of Squares: 3249200
## R-Squared:      0.045244
## Adj. R-Squared: 0.043836
## Chisq: 128.563 on 4 DF, p-value: < 2.22e-16

Even Fancy Regression Output

Dependent variable:
value
treatmentmessage 21.985***
(1.994)
treatmentmixed 10.609***
(1.925)
treatmentpaycomm 10.886***
(2.144)
treatmentteamtreat 16.313***
(2.629)
Constant 39.385***
(1.451)
Observations 2,718
R2 0.045

Cases: Swoboda

Econometrics

  • Basics of slides w/ math & models w/o teaching to students (yet; coming soon!)

Senior Seminar in Urban Economics

  • 15 Students w/ reproducible final project
  • Reproduce work in published papers with 'problems' as pedagogy (what happens if you need to update your data?)
  • Create custom progress reports for students

Micro

  • Interactive shiny apps for students

Cases: Halliday

Behavioral Economics

  • 20 - 35 Students (taught twice)
  • Slides, notes and assignments in Rmd & html
  • Students do a reproducible research project in teams
  • Mid-term in Rmd & final project in Rmd
  • Fully reproduced work in 3/5; partially in 2/5
  • Mid-way through second time

Special Studies

  • Reproducible project on job satisfaction, voice, autonomy
  • Reproducible project on Alaska permanent fund and welfare

Cases: Dvorak

Business Analytics

  • 25 - 35 Students; upper-level
  • 3 main outcomes
  • ability to manipulate data
  • ability to analyze data
  • the ability to formulate questions that can be answered using data
  • use train & test regimen from data science and machine learning
  • Homework & exams in Rmd; simulate 'real world' with access to internet & notes
  • Students perform better in senior theses

Honors theses

  • See what Michael says about theses; similar ideas apply

Lessons from experience

Michael:

Students will only learn commands through graded assignments

Aaron:

Students can struggle with basic computing (working directory, etc.)

Lessons from Simon

Students have to adjust to getting the Basics Right

  • file paths
  • script vs. markdown
  • source vs. output

Students know WYSIWYG

  • MS Word & G docs are WYSIWYG, but Rmd is not.

Installing packages

  • analogy: you have to install apps (packages) to do different things on your (RStudio & R)
  • Chrome extensions

  • R Studio Server = GOOD & free

Math?

How about Bayes' Rule?

\[Pr(\mbox{Outcome} | \mbox{signal}) = \frac{\theta p}{\theta p - (1 - \theta)(1 - p)}\]

R Markdown uses \(\LaTeX\) for math and it immediately gets displayed in R Studio.

That is, \(\LaTeX\) without the challenges of learning the packages, tables, etc that makes learning \(\LaTeX\) so hard.

In-line equations are bracketed by single dollar signs $.

Off-set equations are bracketed by double dollar signs $$.

Literature?

  • McGoldrick (2008) - senior empirical projects
  • Imazeki (2014) - data literacy & team projects & "real-world problem"
  • Ball & Medeiros (2013) - TIER protocol for teaching
  • Shapiro & Gentzkow (2014) - reproducible research in Econ (with RAs)
  • Knittel & Metaxoglou (forthcoming, JEM) - a methodology for econometric work
  • Baumer et al (2013) - Stats & R Markdown

Acknowledgments

R & Rmd Link Love?